Skeletal age assessed by TW2 using 20-bone, carpal and RUS score systems: Intra-observer and inter-observer agreement among male pubertal soccer players

The purpose of this study was to determine intra- and inter-observer agreement for the three skeletal ages derived from the TW2 method among male pubertal soccer players. The sample included 142 participants aged 11.0–15.3 years. Films of the left hand-wrist were evaluated twice by each of two observers. Twenty bones were rated and three scoring systems used to determine SA adopting the TW2 version: 20-bone, CARPAL and RUS. Overall agreement rates were 95.1% and 93.8% for, respectively, Observer A and Observer B. Although, agreement rates between observers differed for 13 bones (5 carpals, metacarpal-I, metacarpal-III, metacarpal-V, proximal phalanges-I, III and V, distal phalanx-III), intra-class correlationa were as follows: 0.990 (20-bone), 0.969 (CARPAL), and 0.988 (RUS). For the three SA protocols, BIAS was negligible: 0.02 years (20-bone), 0.04 years (CARPAL), and 0.03 years (RUS). Observer-associated error was not significant for 20-bone SA (TEM = 0.25 years, %CV = 1.86) neither RUS SA (TEM = 0.31 years, %CV = 2.22). Although the mean difference for CARPAL SAs between observers (observer A: 12.48±1.18 years; observer B: 12.29±1.24 years; t = 4.662, p<0.01), the inter-observer disagreement had little impact (TEM: 0.34 years: %CV: 2.78). The concordance between bone-specific developmental stages seemed was somewhat more problematic for the carpals than for the long bones. Finally, when error due to the observer is not greater than one stage and the replicated assignments had equal probability for being lower or higher compared to initial assignments, the effect on SAs was trivial or small.


Introduction
Growth, maturation and development are central processes in long-term participation of children and youth in competitive sports. The preceding processes are not synonymous or interchangeable. Growth involves quantitative changes in body size, proportions, shape and composition [1]. Development implies changes in behavioral domains: cognitive, emotional, social and motor. Finally, maturation marks progress towards the adult state which varies with the biological system: dental, sexual, somatic and skeletal. In the context of youth sports, the concepts are implicit to the long-term athletic development model [2]. Two indicators of biological maturation are commonly used in studies of adolescent athletes. In boys, sexual maturation includes genital and pubic hair development that are limited to pubertal years. Skeletal age (SA) requires standard radiographs of the hand and wrist and is generally considered the preferred indicator as it can be assessed through the first two decades of life [3]. Several protocols are available to determine SA [4][5][6][7][8]. They are similar in principle and require a radiograph of the hand-wrist. Advancements in technology have reduced exposure to radiation to about 0.001 millisievert (mSv), which is equivalent to three hours of television [3]. Briefly, the Greulich-Pyle [4] and Fels [5] methods were developed on American children and adolescents, while the Tanner-Whitehouse [6] is originally based on British youth and was subsequently modified [7,8].
Observations on Portuguese soccer players aged 11-12 years suggest that youth delayed and advanced in terms of skeletal maturity status based on Fels SAs were equally represented, whereas among players 13-14 years late maturing players were underrepresented whereas those classified as average and early were over-represented [9]. The previous confirms soccer as highly selective and the literature suggests a gradient in body size among Portuguese male soccer players aged 13.0-14.1 years of age [10]: those selected for a regional team were taller, heavier, advanced in AS given by the Fels method and had more playing experience than teammates who were not selected. Meantime, among French youth players [11], those who signed a professional contract and played at least one game as a professional were significantly taller and heavier and had a higher estimated aerobic power at baseline (~13 years) compared to peers who did not sign a professional contract, although the groups did not differ in skeletal maturity assessed by the Greulich Pyle protocol. Finally, a survey of the skeletal maturity status of Serbian soccer players aged 14 years using the Tanner-Whitehouse method, more precisely the radius-ulna-short bones (RUS), noted that late maturing players were more likely to attain a professional career compared to early maturing peers [12].
The above cited youth soccer literature produced different results which highlight the need to discuss the generalization of studies based on concurrent methods of SA assessment. In fact, results may reflect variation in the methods to determine SA and/or specific characteristics of youth soccer in Portugal, France and Serbia. The literature already examined the agreement of concurrent protocols for SA determination, particularly during pubertal years overlapping to selection, specialization into playing positions [10] and vulnerability to sport injuries [13]. For example, the SAs of 40 male Spanish soccer players aged 12.5-16.1 years were assessed with the TW3 and Fels methods [14]. A consistent trend for lower SAs with the TW3 RUS compared to Fels was evident. More recently [15], two versions of TW RUS method (TW2 versus TW3) were compared in a large international sample of male soccer players aged 10.9-17.9 years. Across the CA range of the sample, TW3 RUS SAs were consistently lower than TW2 RUS SAs. The preceding studies have implications for the classification of youth players by maturity status. Advances in digital imaging technologies combined to research dealing with machine learning have led to the emergence of informatic applications that automatically estimate SA from digitalized radiographs [16]. Meantime, sonography has been proposed as an alternative non-invasive method for determining SA [17]. The preceding includes the operator, while obtaining the image, as an additional source of error.
Taking into account the preceding, error is a central issue in determination of SA. In the study of Spanish players [14], intra-observer differences for Fels and TW3 SAs fluctuated 0.1 to 0.4 years, while technical errors of measurements were small: Fels SA was 0.04 year, TW3 SA was 0.06 year. The objective of the present study was to evaluate intra-observer and interobserver agreements for the SAs derived by TW2 method among male adolescent soccer players: 20-bone protocol (TW2 20 bone SA), carpals (Carpal SA), and 13 long bones (RUS SA). It is hypothesized that even trained observers produce errors in the assessment of TW2 SAs.

Procedures
The present study is derived from the PRONTALSPORT Project (Growth, maturation and athletic performance in pubertal athletes). The project followed the ethical standards established for sports sciences [18] and was approved by the Ethics Committee for Sports Sciences by the University of Coimbra (CE/FCDEF-UC/00122014). Participants were recruited from clubs of Portuguese Midlands having a written agreement with University of Coimbra. Parents of the players signed an informed consent, while the players provided assent. They were informed that their participation was voluntary and they could withdraw at any time. All data were collected within a 2-week period in the Coimbra University Stadium for anthropometry and posterior-anterior radiographs of the left hand-wrist were obtained on the same day at a certified clinic.

Sample
The sample included 142 male adolescent soccer players aged 11.0-15.3 years. All participants were registered in the Portuguese Football Federation as infantiles and initiates. The clubs competed in a 9-month tournament (from middle September until late May). In general, clubs trained 3-5 sessions per week (90-120 minutes) and competed once per week (usually on Saturdays or Sundays).

Chronological and skeletal ages
CA was calculated as the difference between birthdate and the date of the visit to the clinic. The films of the left hand-wrist were evaluated twice by each of the two observers. Observer A (first author) completed a 3-year Bsc in Sport Sciences in addition to a 2-year Msc in Youth Sports including a 27-hour course dealing with biological maturation. Subsequently, enrolled in the PhD programme and already complete a 45-hour training in the assessment of skeletal age that includes 100 assessments using concurrent methods to determine SA. Before assessing the x-rays of the current study, over the past four years determined SA of more than 1000 cases. The second author is Professor at the University of Coimbra over the past 27 years and was trained by the last author in the determination of SA more than 20 years ago and already assessed more than 5000 films using Greulich Pyle, Tanner-Whitehouse and Fels protocols. Repeated assessments by each of the two observers were obtained after one month.
The TW method-version 2 (TW2) was used to assess skeletal age [7]. The method is based on matching a specific bone on the radiograph with the verbally described criteria for specific stages for the bones. Twenty bones were rated: 13 long bones (radius, ulna, the metacarpals and the proximal, middle and distal phalanges of the first, third and fifth digits) and seven carpals (excluding the pisiform. Stages were essentially the same of the original TW version [6]. The three scoring systems are specific to each SA in the TW2 version: 20-bone, CARPAL and RUS. A specific point score is assigned to each stage for each individual bone. The scores for each bone are summed to give a skeletal maturity score, which ranges from zero (immaturity) to 1000 (maturity). The CARPAL and the RUS bones were somewhat arbitrarily weighted so that each contributed 50% to the total skeletal maturity score and the overall differences between bones within each group were minimized. Finally, sex-specific tables convert the total score at a particular system (20-bone, RUS and CARPAL) into an individual SA. As noted, 1000 points indicates the skeletally mature state and an SA is not assigned for individuals who are skeletally mature [3].

Analyses
Frequencies for bone-specific developmental stages were presented separately for each occasion (time-moment 1; time-moment 2) for observer A and observer B. Rates of intra-observer agreement were calculated for each individual bone and for the total of observations (142 participants multiplied by 20 bones, 2840 observations). Discrepancies for stages between timemoments were noted as -2, -1, +1, +2 as time moment 2 minus time moment 1). Intraobserver mean differences were also calculated using paired t-tests, separately for bone-specific scores (points) and also for the three systems (20-bone, RUS, CARPAL) and for respective SAs. The preceding was done separately for observer A and observer B. Based on timemoment 2 for each observer, similar analyses were done to examine inter-observer variation in assessments. Technical errors of measurement (TEM), coefficients of variation (%CV) and intra-class correlation coefficients (ICC) were calculated. The magnitude effect was calculated using d-values [19] and interpreted as follows [20]: d<0.20 (trivial), 0.20<d<0.60 (small), 0.60<d<1.20 (moderate), 1.20<d<2.00 (large), 2.00<d<4.00 (very large), and >4.00 (nearly perfect). The analyses using SAs (20-bone SA, Carpal SA, RUS SA) as dependent variables were limited to participants who were not skeletally mature. Significance level was set at 5%. Analyses were performed using the Statistical Package for the Social Sciences version 26.0 (SPSS Inc., IBM Company, Armonk, NY, USA) and GraphPad Prism (version 5 for Windows, GraphPad Software, San Diego California USA, www.graphpad.com).

Results
Developmental stages for each of the 20 bones at time moment 1 and time moment 2 for each observer are summarized in Table 1. Agreement was 95.1% and 93.8% for, respectively, Observer A and Observer B. Intra-observer error assessed as the difference between timemoment 2 minus time-moment 1 was equally distributed: 70 negative (2.5%) and 69 positive (2.4%) for observer A; and 91 negative (3.2%) and 85 positive (3.0%) for observer B. Technical errors of measurements, coefficients of variation and intra-class correlations for Observer A are summarized in Table 2. For the 20-bone system, mean difference between time moments was significant for the capitate (t = 2.022, p<0.05), although the CV was less than 5% and ICC was 0.823. The ICC fluctuated between 0.823 and 0.993 for, respectively, the capitate and distal phalanx-I; the coefficient was 0.997 (TEM = 8.01, %CV = 0.95) for the 20-bone score. In the CAR-PAL protocol, the capitate was again the single bone presenting an intra-observer mean difference (t = 2.022, p<0.05; TEM = 6.41, %CV = 3.03; ICC = 0.834). In contrast, there was negligible variation in the CARPAL score (TEM = 8.99, %CV = 0.97; ICC = 0.993). For the RUS protocol, mean differences were not significant and the ICC coefficient for the RUS score was 0.997 (TEM = 13.92, %CV = 2.60). Similarly, intra-observer agreement for observer B on the three scoring systems is summarized in Table 3. Overall, ICC scores were acceptable for each system: 20-bone (TEM = 9.68, %CV = 1.15; ICC = 0.996), CARPAL (TEM = 12.95, %CV = 1.32; Table 1 Table 4. Overall, they were greater than 80% with the 20-bone protocol, but significant differences were noted for 13 bones (5 carpals, metacarpal-I, metacarpal-III, metacarpal-V, proximal phalanges-I, III and V, distal phalanx-III). Bone-specific ICC coefficients ranged from 0.791 to 0.974. The lack of concordance between observers was similar for the CARPAL and RUS systems. Divergence between observers was noted for four of the seven CARPALS and for eight of 13 bones in the RUS system. However, the ICC coefficients for the total scores for each system were 0.990 (20-

Discussion
The present study evaluated intra-observer agreement for SA assessments on two independent occasions using the TW2 20-bone, CARPAL and RUS protocols among male soccer players 11-15 years of age. Overall agreement between the two time-moments was acceptable for the three systems. Discrepancies did not exceed one stage and there was no specific trend for the replicate assessment to exceed or fall below that for the initial assessment. With the 20-bone protocol, bone-specific technical errors of measurement were always < 5% of one observer and exceeded 5% for only three bones by the other observer. Disagreement seemed slightly higher for the CARPAL and RUS protocols which are based on smaller number of bones; this likely reflected the scoring system as the 20 bone, CARPAL and RUS protocols were based on a 1000-point scale. Nevertheless, allowing for several problematic bones, intra-observer agreement for the respective SAs were acceptable both in terms of scores and assigned SAs.
TW2 protocol has been updated (TW3) and has been used in the sports sciences [11,21,22]. The original version (TW1) was developed on a British sample of average socioeconomic status [6]. The scores were designed to represent biological weights for each of 20 bones bone and TEM (technical error of measurement); %CV (coefficient of variation); ICC (intra-class correlation coefficient).
https://doi.org/10.1371/journal.pone.0271386.t002 the overall score was obtained by summing the scores of the 20 bones. Specific tables were used to convert the 20 bone score into a SA (20-bone TW1-SA). The first revision of the method (TW2) retained the verbal criteria for the respective stages of the 20 bones with few refinements [7]: radius (stage J was deleted), ulna (stages I was deleted) and for five carpals (capitate, triquetral, lunate, scaphoid, trapezoid) the final stage I was deleted. This initial revision included changes in the scores associated with each stage. Three maturity scores were separately developed for boys and girls to derive an SA with each protocol: carpals (CARPAL TW2 -SA), radius, ulna and short bones (RUS TW2 -SA) in addition to the 20-bone TW2SA. The most recent revision for the TW protocol (TW3) incorporated several additional samples of children and adolescents in revising the tables for converting the CARPAL and RUSs into SAs [8]. The British samples of the initial study dated from 1950s was retained while samples from Belgium (Leuven Growth Study in the 1970s) [23], Spain (Bilbao in the 1980s) [24], Japan (Tokyo in 1986) [25], Italy (north of Italy) [26], Argentina (LaPlata in the 1970s) [27], and the U.S. (Texas, European-American ancestry) [28] were added.

Table 3. Descriptive statistics (mean ± standard deviation) for each bone score with the three scoring systems (TW2 20-bone, Carpal, RUS) assigned by observer B on two occasions (time moment 1 versus time moment 2), paired t-tests, effect sizes, technical errors of measurement, coefficients of variation and intra-class correlation coefficients in 142 adolescent male soccer players.
The specific stages and corresponding scores were the same as in TW2, but the TW3 revision deleted the 20-bone SA. As such, the TW3 revision includes only sex-specific tables CAR-PAL TW3-SA and RUS TW3-SA. In addition, skeletal maturity for the RUS TW3 protocol is attained at 16.5 years for males and 15.0 years for females. In the preceding versions of the TW method, the pre-mature state (999 points) for males corresponded to an SA of 17.9 years with the TW2 20-bone, 18.1 years with the TW2-RUS and 14.9 years with the TW2-CARPAL scoring protocols.
Early studies reporting intra-observer agreement of the TW2 method date to 1970s. In a sample of Swedish 122 boys and 90 girls 1 month to 7 years of age, replicate assessments had an overall agreement rate of about 80% [29]. Among 3817 Danish school children 7 to 18 years TW 20 bone scores largely matched the British reference [30]. In the preceding study, 90 radiographs were rated twice and agreement rates were 88-89% for the long bones and 84-96% for the short bones. Since the carpals attained the final stages at earlier ages compared to long bones, the Danish study decided to examine x-rays from 7-13 years old boys and 7-11 years old girls, in a total of 60 cases, to obtain an agreement rate ranging 82-93%.
Meantime, TW2 assessments was previously carried out using three observers [31]. Two sets of x-rays in a random order obtained from the Harpenden Longitudinal Growth Study and from the Leuven Longitudinal Study of Belgian Boys were used to test the agreement rates between observers. Significant differences were found in mean SA between observers for 20-bone SA and CARPAL SA. In contrast, no significant differences in mean SA between observers were found for RUS SA. In the present study, after converting scores to SAs, interobserver mean differences were not significant for the TW2-20bone and for TW2-RUS SAs. In contrast, the inter-observer difference with the TW2-CARPAL protocol was a source of error with 15 cases exceeding the limits of agreement in the present study. Among 110 Danish TEM (technical error of measurement); %CV (coefficient of variation); ICC (intra-class correlation coefficient).
https://doi.org/10.1371/journal.pone.0271386.t003 children and adolescents aged 6-16 years [32], intra-observer agreement fluctuated between 82% to 100%, and consistent with the current study, disagreements did not exceed more than one stage with capitate diagnosed as the most critical bone for disagreement.

Table 4. Descriptive statistics (mean ± standard deviation) for each bone score in the three scoring systems (TW2 20-bone, Carpal, RUS) assigned by observers A and B, paired t-test, effect sizes, technical errors of measurement, coefficients of variation and intra-class correlation coefficients in
Inter-observer agreement rates TW SA assessment are less frequently reported in the literature compared to intra-observer differences. In a study of Dutch children [33], 60 radiographs of boys 10 through 16 years were rated with the TW protocol by an expert and a Dutch author. The percentage of agreement for the ulna was 83% and that for the radius 66% with a systematic disagreement that was concentrated in the assessment of stage F. This prompted the authors to hypothesize a differential impact of observer expertise among youth 10-12 years of age. Meantime, in the present study of soccer players, disagreements between observers that exceeded the limits of agreement were concentrated between 11-13 years for TW2 Carpal SA and between 11.5-14.0 years for TW2 20-bone SA (see Fig 1).
The literature on the skeletal maturity status of youth soccer players has consistently shown that the sport tends to favor early maturing players as they transition into the adolescent years [3,9]. A band of plus/minus 1.0 year is commonly used to classify players as late, average or early maturing. In the present study and based on assessments of observer A (first author), early maturing players represented 36% at time moment 1 (TM1) and 37% (TM2) while using the TW-2 20-bone SA, thus suggesting that intra-individual error marginally impacted the frequencies of maturity status. Corresponding estimates of maturity classifications with TW2-RUS SA classified 49% and 50% of the participants as advanced in TM1 and TM2, respectively. In contrast, percentages of players classified as advanced with TW2-CARPAL SA were, respectively, 8% and 11%. By inference, intra-observer error in assessments did not appear to influence maturity status classifications.
The present study highlights the expertise of SA assessments with the TW2 protocol among adolescent soccer players. The study is novel as it considers intra-observer analyses for each bone in addition to the three protocols (20-bone, CARPALS, RUS) both using scores and assigned SAs as the dependent variable. Nevertheless, few limitations should be considered. First, the study was focused on the ability of two observers and inter-examiner agreement is essential in research projects using more than two examiners. Additionally, the results are limited to a sample of 142 male soccer players 11-15 years. Given the CA range, it was not possible to evaluate early stages for specific boys, e.g., stages B-E for the radius, capitate, hamate and distal phalange III; B-D for the triquetral, lunate, metacarpals II-V, proximal phalanges I-V, and distal phalanges I and V; and for stages B-C of the ulna, scaphoid, trapezium, trapezoid, and metacarpal I. By inference, there is a need for additional research on pre-teens, especially for CARPAL protocol. Note, the age interval of the current sample included 25 participants who were classified as skeletally mature and as such were not included in the calculations illustrated in Fig 1. Nevertheless, the literature generally considers descriptors of the stages for round bones (carpals) more difficult to evaluate compared to long bones and as noted, the capitate has been previously indicated as problematic [31,34,35]. The carpals are more difficult to TEM (technical error of measurement); %CV (coefficient of variation); ICC (intra-class correlation coefficient).
https://doi.org/10.1371/journal.pone.0271386.t004 evaluate because they involve assessments of shape and radiopaque lines or zones, whereas assessments of the long bones tend to concentrate on the centers of ossification and epiphyseo-diaphysial relationships and fusion [36].

Conclusions
In summary, the assignment of developmental stages is specific for each bone and is somewhat more problematic for the round (carpals) than for the long bones. Examiners should be encouraged to evaluate their expertise on perhaps 100 images spanning a broad range of CAs.
Data quality using adolescent samples should not be generalized to early ages. Finally, if disagreements between replicate assessments are not greater than one stage and shows equal probability for the replicates to be lower or higher compared to initial assignments, the effect on assigned SAs appears to be trivial or small.